Probabilistic Management of OCR Data using an RDBMS
نویسندگان
چکیده
The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over 1000x slower versus standard text processing for single table select-project queries. However, many applications may have quality-performance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called Staccato that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme’s properties, and describe how we integrate our scheme with standard-RDBMS text indexing.
منابع مشابه
Integrating INQUERY with an RDBMS to Support Text Retrieval
Information is a combination of structured data and unstructured data. Traditionally, relational database management systems (RDBMS) have been designed to handle structured data. IR systems can handle text (unstructured data) very well but are not designed to handle structured data. With present day information being a combination of structured and unstructured data, there is an increasing dema...
متن کاملExpert system for data migration between different database management systems
This paper deals with the design of the fuzzy expert system for processing data migration between different relational database management systems (RDBMS). At the beginning we identify current state of data migration between different RDBMS. Then we propose new approach which suggests creating of expert system as a tool for migration of database tables and their data between different RDBMS. Th...
متن کاملTransaction Management for XML Stored in Relational Database Systems⋆
Nowadays, modern commercial relational database management systems (RDBMS) enable functionality to store, query and update XML data. One of the key problems is efficient handling of concurrent access to XML data stored in RDBMS. In this paper, we present an efficient concurrency control method for XML data stored in RDBMS. Our approach is based on Grabs et. al. work on XMLTM, an XML transaction...
متن کاملODM BLAST: Sequence Homology Search in the RDBMS
Performing sequence homology searches against DNA or protein sequence databases is an essential bioinformatics task. Past research efforts have been primarily concerned with the development of sensitive and fast sequence homology search algorithms outside of the relational database management system (RDBMS). Oracle Data Mining (ODM) BLAST enables BLAST to be performed in a RDBMS. ODM BLAST reli...
متن کاملAn Indoor Positioning System Based on Wi-Fi for Energy Management in Smart Buildings
To offer indoor services to occupants in the context of smart buildings, it is necessary to consider information concerning to the identity and location of the occupants. This paper proposes an indoor positioning system (IPS) based on Wi-Fi fingerprint and K-nearest neighbors (KNN) method. The positioning of a mobile device (MD) using Wi-Fi technology involves online and offline phases. In this...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 5 شماره
صفحات -
تاریخ انتشار 2011